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Abstract: The main purpose of Feature Subset Selection is to find a reduced subset 
of attributes from a data set described by a feature set. The task of a feature selection 
algorithm (FSA) is to provide with a computational solution motivated by a certain 
definition of relevance or by a reliable evaluation measure. In this paper several funda- 
mental algorithms are studied to assess their performance in a controlled experimental 
scenario. A measure to evaluate FSAs is devised that computes the degree of matching 
between the output given by a FSA and the known optimal solutions. An extensive 
experimental study on synthetic problems is carried out to assess the behaviour of 
the algorithms in terms of solution accuracy and size as a function of the relevance, 
irrelevance, redundancy and size of the data samples. The controlled experimental 
conditions facilitate the derivation of better-supported and meaningful conclusions. 

Keywords: Feature Selection Algorithms; Empirical Evaluations; Attribute relevance 
and redundancy. 



INTRODUCTION 
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■ "Thfe feature selection problem is ubiquitous in an induc- 
tive machine learning or data mining setting and its im- 
portance is beyond doubt. The main benefit of a correct 
selection is the improvement of the inductive learner, ei- 
ther in terms of learning speed, generalization capacity or 
simplicity of the induced model. On the other hand, there 
are the scientific benefits associated with a smaller num- 
ber of features: a reduced measurement cost and hopefully 
a better understanding of the domain. A feature selection 
algorithm (FSA) is a computational solution that should 
be guided by a certain definition of subset relevance, al- 
though in many cases this definition is implicit or followed 
in a loose sense. This is so because, from the inductive 
learning perspective, the relevance of a feature may have 
several defini tions depending on the preci se objective that 
is looked for (jCaruana and Freitagl . 119941 ) . Thus the need 
arises to count on common sense criteria that enables to 
adequately decide which algorithm to use (or not to use) 
in certain situations. 

This work reviews the merits of several fundamental fea- 



ture subset selection algorithms in the literature and as- 
sesses their performance in an artificial controlled experi- 
mental scenario. A scoring measure computes the degree 
of matching between the output given by the algorithm 
and the known optimal solution. This measure ranks the 
algorithms by taking into account the amount of relevance, 
irrelevance and redundancy on synthetic data sets of dis- 
crete features. Sample size effects are also studied. 

The results illustrate the strong dependence on the par- 
ticular conditions of the algorithm used, as well as on the 
amount of irrelevance and redundancy in the data set de- 
scription, relative to the total number of features. This 
should prevent the use of a single algorithm specially when 
there is poor knowledge available about the structure of 
the solution. More importantly, it points in the direction 
of using principled combinations of algorithms for a more 
reliable assessment of feature subset performance. 

The paper is organized as follows: we begin in Section [5] 
reviewing relevant related work. In section [3] we set a pre- 
cise definition of the feature selection problem and briefly 
survey the main categorization of feature selection algo- 
rithms. We then provide an algorithmic description and 



comment on several of the most widespread algorithms in 
section m The methodology and tools used for the empir- 
ical evaluation are covered in section [5] The experimental 
study, its results and a general advice to the data mining 
practitioner are developed in section [6] The paper ends 
with the conclusions and prospects for future work. 



2 MOTIVATION AND RELATED WORK 

Previous experimental work on featu re selection algo- 
rithm s for compa rat ive pu rpo ses include Aha and Bankert 



(Il995l). iDoakI (Il992l) I Jain and Zongked (Il997 ) 



Kudo and Sklanskvl (|l997t ) and iLiu and Setionol (|l998bl ). 
Some of these studies use artificially generated data 



sets, like the widespread Parity, Led or Monks problems 



(jThrunl . 119911 ) . Demonstrating improvement on synthetic 
data sets can be more convincing that doing so in typical 
scenarios where the true solution is completely unknown. 
However, there is a consistent lack of systematical ex- 
perimental work using a common benchmark suite and 
equal experimental conditions. This hinders a wider 
exploitation of the power inherent in fully controlled 
experimental environments: the knowledge of the (set of) 
optimal solution(s), the possibility of injecting a desired 
amount of relevance, irrelevance and redundancy and the 
unlimited availability of data. 

Another important issue is the way FSA performance is 
assessed. This is normally done by handing over the solu- 
tion encountered by the FSA to a specific inducer (during 
of after the feature selection process takes place). Leav- 
ing aside the dependence on the particular inducer cho- 
sen, there is a much more critical aspect, namely, the rela- 
tion between the performance as reported by the inducer 
and the true merits of the subset being evaluated. In this 
sense, it is our hypothesis that PSAs are very affected by fi- 
nite sample sizes, which distort reliable assessments of sub- 
set relevance, eve n in the presence of a very sophisticated 
search algorithm (jReunanenl . 120031 ) . Therefore, sample size 
should also be a matter of study in a through experimental 
comparison. This problem is aggravated when using filter 
measures, since in this case the relation to true generaliza- 
tion a bility (as expressed by the Bayes error) can be very 



loose ( Ben-Bassat . 1982[ ) 



A further problem with traditional benchmarking data 
sets is the implicit assumption that the used data sets are 
actually amenable to feature selection. By this it is meant 
that performance benefits clearly from a good selection 
process (and less clearly or even worsens with a bad one) . 
This criterion is not commonly found in similar experimen- 
tal work. In summary, the rationale for using exclusively 
synthetic data sets is twofold: 

1. Controlled studies can be developed by systematically 
varying chosen experimental conditions, thus facilitat- 
ing the derivation of more meaningful conclusions. 

2. Synthetic data sets allow full control of the experi- 
mental conditions, in terms of amount of relevance. 



irrelevance and redundancy, as well as sample size and 
problem difficulty. An added advantage is the knowl- 
edge of the set of optimal solutions, in which case the 
degree of closeness to any of these solutions can thus 
be assessed in a confident and automated way. 

The procedure followed in this work consists in generat- 
ing sample data sets from synthetic functions of a number 
of discrete relevant features. These sample data sets are 
then corrupted with irrelevant and/or redundant features 
and handed over to different PSAs to obtained a hypoth- 
esis. A scoring measure is used in order to compute the 
degree of matching between this hypothesis and the known 
optimal solution. The score takes into account the amount 
of relevance, irrelevance and redundancy in each subopti- 
mal solution as yielded by an algorithm. 

The main criticism associated with the use of artificial 
data is the likelihood that such a problem be found in real- 
world scenarios. In our opinion this issue is more than com- 
pensated by the mentioned advantages. A PSA that is not 
able to work properly in simple experimental conditions 
(like those developed in this work) is in strong suspect of 
being inadequate in general. 



3 THE FEATURE SELECTION PROBLEM 

Let X be the original set of features, with cardinality \X\ = 
n. The continuous feature selection problem (also called 
Peature Weighing) refers to the assignment of weights Wi 
to each feature Xi G X in such a way that the order cor- 
responding to its theoretical relevance is preserved. The 
binary feature selection problem (also called Peature Sub- 
set Selection) refers to the choice of a subset of features 
that jointly maximize a certain measure related to subset 
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the continuous problem solution. Although both types can 
be seen in an unified way (the latter case corresponds to 
the assignment of weights in {0, 1}), these are quite differ- 
ent problems that reflect different design objectives. In the 
continuous case, one is interested in keeping all the features 
but in using them differentially in the learning process. On 
the contrary, in the binary case one is interested in keeping 
just a subset of the features and (most likely) using them 
equally in the learning process. 

A common instance of the feature selection problem can 
be formally stated as follows. Let J be a performance 
evaluation measure to be optimized (say to maximize) de- 
fined as J : V{X) R+ U {0}. This function accounts 
for a general evaluation measure, that may or may not be 
inspired in a precise and previous definition of relevance. 
Let c(x) > represent the cost of variable x (measurement 
cost, needed technical skill, etc) and call c(X') = ^ c(x), 

xGX' 

for X' e V{X). Let Cx = c{X) be the cost of the 
whole feature set. It is assumed here that c is additive, 



that is, c{X' U X") = c{X') + c{X") (together with non- 
negativeness, this imphes that c is monotone). 

Definition 1 (Feature Subset Selection) The se- 
lection of an optimal feature subset ("Feature Subset 
Selection") is either of two scenarios: 

(1) Fix Co < Cx- Find the X' G ^{X) of maximum 
J{X') among those that fulfill c{X') < Cq. 

(2) Fix Jo > 0. Find the X' e V{X) of minimum c{X') 
among those that fulfill J{X') > Jq. 

If the costs are unknown, a meaningful choice is obtained 
by setting c(x) = 1 for all x e X. Then c{X') = \X'\ and 
c can be interpreted as the complexity of the solution. In 
this case, (1) amounts to finding the subset with high- 
est J among those having a maximum pre-specified size. 
In scenario (2), it amounts to finding the smallest subset 
among those having a minimum pre-specified performance 
(as measured by J). Only with these restrictions, an op- 
timal subset of features need not exist; if it does, is not 
necessarily unique. In scenario (1), a solution always ex- 
ists by defining c(0) = and J(0) to be the value of J with 
no features. In case (2), if there is no solution, an adequate 
policy may be to set Jq — eJ{X), e > 0, and progressively 
lower the value of e. If there is more than one solution (of 
equal performance and cost, by definition) one is usually 
interested in them all. We shall speak of a FSA of Type 
1 (resp. Type 2) when it has been designed to solve the 
first (resp. second) scenario in Def. [T] If the FSA can 
be used in both scenarios, we shall speak of a general-type 
algorithm. In addition, if one has no control whatsoever, 
we shall speak of a free-type algorithm. 

We shall use the notation S{X') to indicate the subsam- 
ple of S described by the features va X' <Z X only. 



4 FEATURE SUBSET SELECTION ALGORITHMS 



The relationship between a FSA and the inductive learning 
method used to infer a model can take three main forms: 
filter, wrapper or embedded, which we call the mode: 

Embedded Mode: The inducer has its own FSA (either 
explicit or impl i cit). The metho ds to induce logical con- 
junctions (IVerd . 119751 : IWinstonl . [1975), decision trees or 
artificial neural networks are examples of this embedding. 

Filter Mode: If feature selection takes place before the 
induction step, the former can be seen as a filter (of non- 
useful features). In a general sense it can be seen as a 
particular case of the embedded mode in which feature 
selection is used as a pre-processing. The filter mode is 
then independent of the inducer that evaluates the model 
after the feature selection process. 

Wrapper Mode: Here the relationship is taken the other 
way aro und: the FSA uses the learning algorithm as a sub- 
routine ( John et al. . 119941) . The argument in favor of this 



mode is to equal the bias of the FSA and the inducer that 
will be used later to assess the goodness of the model. A 
main disadvantage is the computational burden that comes 
from calling the inducer to evaluate each and every subset 
of considered features. 

In what follows several of the currently most widespread 
FSAs in machine learning are described and briefiy com- 
mented on. General-purpose search algorithms, as genetic 
algorithms, are excluded from the review. None of the 
algorithms allow the specification of costs in the features. 
Most of them can work in filter or wrapper mode. The fea- 
ture weighing algorithm Relief has been included, both in 
the review and in the experimental comparison, as a com- 
plement. This is so because it can also be used to select a 
subset of features, although a way of getting a subset out of 
the weights has to be devised. In the following we assume 
again that the evaluation measure J is to be maximized. 

4.1 Algorithm LVF 



LvF (Las Vegas Filter) (|Liu and Setionol . Il998a^ is a type 
2 algorithm that repeatedly generates random subsets and 
computes the consistency of the sample: an inconsistency 
in X' and S is defined as two instances in S that are equal 
when considering only the features in X' and that belong to 
different classes. The aim is to find the minimum subset of 
features leading to zero inconsistencies. The inconsistency 
count of an instance A G 5* is defined as: 



ICx'{A)^X'{A) 



maxAfc(A) 

k 



(1) 



where X'{A) is the number of instances in S equal to A 
using only the features in X' and A^(A) is the number of 
instan ces in S of class k equal t o A using only the features 
in X' (jLiu and Setionol . 1 1998b[ ). The inconsistency rate of 
a feature subset in a sample S is then: 



IRiX') = 



\S\ 



(2) 



This is a monotonic measure, in the sense 

Xi c ^2 ^ IR{Xi) > IR{X2) 

The evaluation measure is then J{X') = i]^(^x')+i ^ '^^ 
that can be evaluated in OdS"]) time using a hash table. 

LvF is described as Algorithm[TJ It has been found to be 
part icularly eflncient fo r data sets having redundant featu- 
res ( Dash et al.l . Il997 ). Arguably its main advantage may 
be that it quickly reduces the numb er of features in the 
initial stages with certain confidence (|Dash and Liul . ll998 : 
Dash et al. 1, 12OOOI) : however, many poor solution subsets 
are analyzed, wasting computing resources. 

4.2 Algorithm LVI 

Lvi (Las Vegas Incremental) is also a type 2 algorithm 
and an evolution of Lvi. It is based on the grounds that 
it is not necessary to use the whole sample S in order to 



Input : 

max — the maximum number of iterations 
J — evaluation measure 

S{X) — a sample S described by X , \X\ = n 
Output : 

L — all equivalent solutions found 

L := [] // L stores equally good sets 

Best := X // Initialize best solution 
Jo := J{S{X)) II minimum allowed value of J 
repeat max times 

X' := Random_SubSet(_Best) 
if J{S(X')) > Jo then 

if \X'\ < \Best\ then 
Best := X' 

L := [X'] II L is reinitialized 

else 

if \X'\ = \Best\ then 
L := append ( L, X' ) 

end 

end 

end 

end 



Algorithm 1: LvF (Las Vegas Filter). 



Input : 

max — the maximum number of iterations 
J — evaluation measure 

S{X) — a sample S described by X,\X\=n 
p — initial percentage 
Output : 

X' — solution found 

50 := portion(S', p) // Initial portion 

51 := S\So 

Jo '■= J{S{X)) II Minimum allowed value of J 

repeat forever 

X' := LVF {max,J,So{X)) 

if J{Si{X'))>Jo then stop 

else 

C := {x Si{X') making Si{X') inconsistent} 

50 := So UC 

51 := Si\C 

end 

end 



Algorithm 2: Lvi (Las Vegas Incremental). 



evaluate the measure J, which for th is algorithm is again Input : 



consistency ( Liu and Setionol . Il998bl) . The algorithm is P " sampling percentage 

CI — distance measure 



described as Algorithm H It departs from a portion So siX) - a sample S described by X,\X\=n 

of 5; if LvF finds a sufhciently good solution in Sq then Output: 

Lvi halts. Otherwise the set of instances in 5 \ 5o making w - array of feature weights 

iS*! inconsistent is added to So, this new portion is handed , i^, 

• • I T • • 1 let m := p\S\ 

over to LvF and the process is iterated, intuitively, the initialize array w[] to zero 

portion cannot be too small or too big. If it is too small, do m times 
after the first iteration many inconsistencies will be found ^ •= Random.Instancc (5) 

and added to the current portion, which will hence be very ' ~_ ^"""^J* ^^/f ■'c^ 

1 Jim '• — ^ ' 6&r — iVl 1 S S ( J , o 1 

similar to S. If it is too big, the computational savings will foj. each i g [l..n] do 

be modest. The authors suggest p — 10 % or a value pro- w[i] -.= w[i] + di{I,Inrn)lm - di(I,I„h)l'>n 

portio nal to the number of features. In iLiu and Motodal 

( 19981 ) it is reported experimentally that Lvi adequately 

chooses relevant features, but may fail for noisy data sets. Algorithm 3: Relief. 

in which case the algorithm it is shown to consider irrele- 
vant features. Probably Lvi is more sensible to noise than 
LvF in cases of small sample sizes. 



4.3 Algorithm RELIEF 

Relief ( Kira and RendeUl . Il992 ) is a general- type algo- 
rithm that works exclusively in filter mode. The algorithm 
randomly chooses an instance I G S and finds its near hit 
and its near miss. The former is the closest instance to / 
among all the instances in the same class of /. The lat- 
ter is the closest instance to / among all the instances in 
a different class. The underlying idea is that a feature is 
more relevant to / the more it separates I and its near 
miss, and the least it separates / and its near hit. The 
result is a weighed version of the original feature set. The 
algorithm for two classes is described as Algorithm |3l 

When costs are just sizes, the algorithm can be used to 
simulate a type 1 scenario by iteratively checking the se- 



quence of the first Cq nested subsets in the order given by 
decreasing weights, calling the J measure, and returning 
that subset with the highest value of J. To simulate a 
type 2 scenario, the same sequence is checked looking for 
the first element in the sequence that yields a value of J not 
less than the chosen Jq. The more important advantage 
of Relief is the rapid assessment of irrelevant features 
with a principled approach; however it does not make a 
good discrimination among redundant features. The algo- 
rithm has been foun d to choose correla ted features instead 
of relevant features ( Dash et al.l Il997 ) , and therefore the 



optim al subset can be far from assured (jKira and Rendeli 
19921 ) . Some v ariants have been proposed to account for 



several classes (jKononenkoL 119941 ) , where the k more sim- 
ilar instances are selected and their averages computed. 



4.4 Algorithms SFG/SBG 

These two are classical general-type algorithms that may 
work in filter or wrapper mode. Sfg (Sequential Forward 
Generation) iteratively adds features to an initial subset, 
trying to improve a measure J, always taking into account 
those features already selected. Consequently, an ordered 
list can also be obtained. Sbg (Sequential Backward Gen- 
eration) is the backward counterpart. They are jointly 
descri bed as Algor ithm IH When the number of features is 
small, lDoakl(|l992h reported that Sbg tends to show better 
performance than Sfg, most likely because Sbg evaluates 
the contribution o f all fe atures from the onset. In addition. 
Aha and BankertI ( 1995f ) points out that Sfg is preferable 
when the number of relevant features is (known to be) 
small; otherwise Sbg should be used. Interestingly, it was 
also reported that Sbg did not always have bette r per- 
forma nce than Sfg, contrary to the conclusions in iDoakI 
( 19921) . Besides, Sfg is faster in practice. The algorithms 
W-Sfg and W-Sbg (W for wrapper) use the accuracy of 
an inducer as evaluation measure. 



Input : 

S{X) — a sample S described by X,\X\ = n 
J — evaluation measure 
Output : 

X' — solution found 

X':=9) /* Forward */ or X' := X /* Backward */ 
repeat 

x' := argmax{J{S{X' U {x}))\x G X \X'} /* Forward */ 
x' := argmax{J(S(X' \ {x})}\x £ X'} /* Backward */ 
X' := X' U{x'} /* Forward */ 
X' := X' \ {x'} /* Backward */ 
until no improvement in J in last j steps 

or X' = X /* Forward * / or X' = /* Backward*/ 

Algorithm 4: Sbg/Sfg (Sequential Backward/Forward 
Generation). 



4.5 Algorithms SFFG/SFBG 

These are free-type algorithms that may work in filter or 
wrapp er mode. Sffg (Se quential Floating Forward Gener- 



ation) (jPudil et al.l . 119941 ) is an exponential cost algorithm 



that operates in a sequential fashion, performing a forward 
step followed by a variable (and possibly null) number of 
backward ones. In essence, a feature is first uncondition- 
ally added and then features are removed as long as the 
generated subsets are the best among their respective size. 
The algorithm (described in Algorithm 5 as a flow-chart) 
is so-called because it has the characteristic of floating 
around a potentially good solution of the specified size. 
The backward counterpart Sfbg performs a backward step 
followed by zero or more forward steps. These two algo- 
rithms have been found to be v ery effective in some situ- 
ations (jjain and Zongken 119971 ) , and are among the most 
popular nowadays. Their main drawbacks are the compu- 
tational cost, that may be unaffordable when the number 



Input; 

S{X) - a sample S described hy X,\X\ = n 

J - evaluation measure 

d - desired size of the solution 

A - maximum deviation allowed with respect to d 
Output: 

solution of size d ± A 





Conditionally 
exclude a feature 
fotind applying 
a step of SBG 
using S{X|^),J 



Is 

this the best 
subset of k-1 
features found 
sofar? 



Put the excluded 
feature back 



Algorithm 5: Sffg (Sequential Floating Forward Genera- 
tion). The set denotes the current solution (of size k); 
S{Xk) is the sample described by the features in Xk only. 



of features nears the hundred ([Bins and Drapeii |2001|) and 
the need to fix the size of the final desired subset. 

4.6 Algorithm QBE 

The Qbb (Quick Branch and Bound) algorithm 
( Dash and Liul . 19981 ) (described as Algorithm [7]) is 



a type 1 algorithm. Actually it is a hybrid one, composed 
of LvF and Abb. The origin of Ab b is in Branch & 
Bound ( Narendra and Fukunagal . [l977l ). an optimal search 
algorithm. Given a threshold /3 (specified by the user), 
the search stops at each node the evaluation of which 
is lower than f3, so that efferent br anches are prun ed. 



Abb (Automatic Branch & Bound) (|Liu et all . Il998l) 



IS 

a variant having its bound as the inconsistency rate of 
the data when the full set of features is used (Algorithm 
[6]). The basic idea of Qbb consists in using LvF to find 
good starting points for Abb. It is expected that Abb 
can explore the remaining search space efficiently. The 
authors reported that Qbb is, in general, more efficient 
than LvF or Abb in terms of average cost of execution 
and selected relevant features. 



5 EMPIRICAL EVALUATION OF FSAs 

The main question arising in a feature selection experimen- 
tal design is: what are the aspects that we would like to 
evaluate of a FSA solution in a given data set? Certainly a 
good algorithm is one that maintains a well-balanced trade- 
off between small-sized and competitive solutions. To as- 
sess these two issues at the same time is a difficult un- 



Input : 

S{X) — a sample S described by X,\X\=n 
J — evaluation measure (monotonic) 
Output : 

L — all equivalent solutions found 

procedure ABB {S{X): sample; var L' : list of set) 
for each a; in X do 

enqueue (Q, X \ {x} ) // remove a feature at 
a time 

end 

while not empty(Q) do 
X' := dequeue (Q) 

// X' is legitimate if it is not a subset of 

a pruned state 
if legitimate(X') and J{S{X')) > Jo then 

L' := append (L',X') 

AEB{S{X'),L') 

end 
end 
end 

begin 

Q := // Queue of pending states 

L' := [X] II List of solutions 

Jo := J{S(Xy) 1 1 Minimum allowed value of J 

ABB (S{X),L') // Initial call to ABB 

A; := smallest size of a subset in L' 

L := set of elements of L' of size k 

end 

Algorithm 6: Abb (Automatic Branch and Bound). 



Input : 

max — the maximum number of iterations 
J — monotonic evaluation measure 
S{X) — a sample S described by X,\X\=n 
Output : 

L — all equivalent solutions found 
L^BB := [] 

L.LVF := LVF {max,J,S{X)) 
for each X' e L.LVF do 

L^BB := concat {L.ABB,ABB{S{X'), J)) 

end 

k := smallest size of a subset in L^ABB 
L := set of elements of L^ABB of size k 

Algorithm 7; Qbb (Quick Branch and Bound). 



dertaking in practice, given that their optimal relationship 
is user-dependent. In the present controlled experimental 
scenario, the task is greatly eased since the size and perfor- 
mance of the optimal solution is known in advance. The 
aim of the experiments is precisely to contrast the abil- 
ity of the different FSAs to hit a solution with respect to 
relevance, irrelevance, redundancy and sample size. 

Relevance: Different families of problems are generated 
by varying the number of relevant features N^. These 
are features that will have an influence on the output and 
whose role can not be assumed by any other subset. 

Irrelevance: Irrelevant features are defined as those not 



having any influence on the output. Their values are gen- 
erated at random for each example. For a problem with 
Nji relevant features, different numbers of irrelevant fea- 
tures Nj are added to the corresponding data sets (thus 
providing with several subproblems for each choice of Nf/). 

Redundancy: In this work, a redundancy exists when a 
feature can take the role of another. Following a parsi- 
mony principle, we are interested in the behaviour of the 
algorithms in front of this simplest case. If an algorithm 
fails to identify redundancy in this situation (something 
that is actually found in the experiments reported below), 
then this is interesting and something we should be aware 
of. This effect is obtained by choosing a relevant feature 
randomly and replicating it in the data set. For a problem 
with Nn relevant features, different numbers of redundant 
features Nn' are added in a way analogous to the genera- 
tion of irrelevant features. 

Sample Size: number of instances \S\ of a data sample S. 
In these experiments, \S\ = akNxc, where a is a constant, 
fc is a multiplying factor, Nt is the total number of features 
{Nr + N] + Nfji) and c is the number of classes of the 
problem. This means that the sample size will depend 
linearly on the total number of features. 

5.1 Evaluation of performance 

We derive in this section a scoring measure to capture the 
degree to which a solution obtained by a FSA matches 
(one of) the correct solution(s). This criterion behaves as 
a similarity s : V{X) x V{X ) — » [0, 1], between subsets o f 
X in the data analysis sense (jChandon and Pinson . 1981 ). 
where s{Xi,X2) > s{Xi,X3) indicates that X2 is more 
similar to Xi than X^, and satisfying s{Xi,X2) = 1 ■^f=^ 
Xi = X2 and s{Xi,X2) — s{X2,Xi). Let us denote by X 
the total set of features, partitioned in X = XrUXiUXh', 
being Xji,Xi,Xjii the subsets of relevant, irrelevant and 
redundant features of X, respectively and call X* C X any 
of the correct solutions (all and only relevant variables, 
no redundancy). Let us denote by A the feature subset 
selected by a FSA. The idea is to check how much A and 
X* have in common. 

Let us define Ar — Xr (1 A, Aj ^ Xj Ci A and Ar/ = 
Xri n A. In general, we have At — Xt n A (hereafter 
T stands for a subindex in {R,I,R'}). Since necessarily 
A C X, we have that A = Ar U ^/ U Ar' is a partition 
of A. The score SxiA) : vlx) [0,1] is defined in 
terms of the similarity in that for all A Q X, Sx {A) = 
s{A,X*). Thus, Sx{A) > SxiA') indicates that A is 
more similar to X* than A' . The idea is to make a flexible 
measure, so that it can ponder each type of divergence 
(in relevance, irrelevance and redundancy) to the correct 
solution. To this end, a set of parameters is collected as 
0^ = {o^R, Oil, aRi} with ax > and ^ ar = 1- 

Intuitive Description. The criterion Sx{A) penalizes 
three situations: (1) There are relevant features lacking in 
A (the solution is incomplete), (2) There are more than 
enough relevant features in A (the solution is redundant) 



and (3) There are some irrelevant features in A (the solu- 
tion is incorrect). 

An order of importance and a weight will be assigned 
(via the ax parameters), to each of these situations. The 
precedent point (3) is simple to model: if suffices to check 
whether |^/| > 0, being A the solution of the FSA. Rel- 
evance and redundancy are strongly related given that a 
feature is redundant or not depending on what other rel- 
evant features are present in A. Notice then that the cor- 
rect solution X* is not unique, and all of them should be 
equally valid. To this end, the features are broken down in 
equivalence classes, where elements of the same class are 
redundant to each other (i.e., any correct solution must 
comprise only one feature of each equivalence class). Be- 
ing A a feature set, we define a binary relation between 
two features and Xj rep- 

resent the same information. Clearly ~ is an equivalence 
relation. Let A/^ be the quotient set of A under ~; any 
correct solution must be of the same size than Xn and 
have one element in every subset of (Xj^ U Xj^/)/~. 

Construction of the score. The set to be split in equiv- 
alence classes is formed by all the relevant features (redun- 
dant or not) chosen by a FSA. Define = {ArUAr')/^ 
( equivalence classes in which the relevant and redundant 
features chosen by a FSA are split), px = {Xr U Xri)/^ 
(same with respect to the original set of features) and 
PA<zx ^ {x e Px \ ^ PA,y x}. For Q quotient 
set, let: 

The idea is to express the quotient between the number 
of redundant features chosen by the FSA and the number 
it could have chosen, given the relevant features present 
in its solution. In the precedent notation, this is written 
(provided the denominator is not null): 

F{pa) 

F{PACX) 

Let us finally build the score, formed by three terms: 
relevance, irrelevance and redundancy. Defining: 

\xiy \XRy 

, [ if F(p^cx) =0 

" I " f£c1) ) otherwise. 

for any AC X the score is defined as Sx{A) — s{A, X*) — 
o^rRa + ctR'R'A + '^i^A- This score fulfills the two condi- 
tions (proof is given in the Appendix): 

1. Sx{A) = 0<=^A = Xi 

2. Sx{A) ^1-^A^X* 



We can establish now the desired restrictions on the be- 
havior of the score. From the more to the less severe: there 
are relevant features lacking, there are irrelevant features, 
and there is redundancy in the solution. This is reflected 
in the following conditions on the ar'- 

1. Choosing an irrelevant feature is better than missing 
a relevant one: > -j^^ 

2. Choosing a redundant feature is better than choosing 

an irrelevant one: > , 

\Xi\ \x,i,\ 

We also define ax = if \Xt\ = 0. Observe that the 
denominators are important for, say, expressing the fact 
that it is not the same choosing an irrelevant feature when 
there were only two that when there were three (in the 
latter case, there is an irrelevant feature that could have 
been chosen when it was not). In order to translate the 
previous inequalities into workable conditions, a parame- 
ter e G (0, 1] is introduced to express the precise relation 
between the a^. Let Orp = The following equations 

have to be satisfied, together with a^, + a/ + aR/ — 1: 

PrQLr = a J, (3iaj = a^, 

for suitable chosen values of (3r and /?/. Reasonable 
settings are obtained by taking 13r = e/2 and /?/ — 2e/3, 
though other settings are possible, depending on the evalu- 
ator's needs. With these values, at equal \Xr\, \Xj\, \Xr>\, 
aR is at least twice more important than a/ (because of 
the e/2) and aj is at least one and a half times more im- 
portant than aRi . Specifically, the minimum values are 
attained for e = 1 (i.e., a^, counts twice a/). For e < 1 
the differences widen proportionally to the point that, for 
e w 0, only aRR will practically count on the overall score. 



6 EXPERIMENTAL EVALUATION 



In the following sections we detail the experimental 
methodology and quantify the various parameters of the 
experiments. The basic idea consists on generating sample 
data sets using synthetic functions / with known relevant 
features. These data sets (of different sizes) are corrupted 
with irrelevant and / or redundant features and handed over 
to the different FSAs to obtained a hypothesis H. The di- 
vergence between the defined function / and the obtained 
hypothesis H will be evaluated by the score criterion (with 
e = 1). This experimental design is illustrated in Fig. [1] 

6.1 Description of the FSAs used 

Up to ten FSAs were used in the experiments. These are 
E-Sfg, Qbb, Lvf, Lvi, C-Sbg, Relief, Sfbg, Sffg, W- 
Sbg, and W-Sfg. The algorithms E-Sfg, W-Sfg are 
versions of Sfg using entropy and the accuracy of a C4.5 
inducer, respectively. The algorithms C-Sbg, W-Sbg are 
versions of Sbg using consistency and the accuracy of a 
C4.5 inducer, respectively. Since Relief and E-Sfg yield 
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Figure 1: Flow-Chart of Experimental Design. 



an ordered list of features Xi according to their weight Wi , 
an automatic filtering criterion is necessary to transform 
every solution into a subset of features. The procedure 
used here to determine a suitable cut point is simple: first 
the weights are sorted in decreasing order (with Wn the 
greatest weight, corresponding to the most relevant fea- 
ture). Then those weights further than two variances from 
the mean are discarded (that is to say, with very high or 
very low weights). The idea is to look for the feature Xj 
such that — — is maximum. Intuitivelv, this corre- 

—wi n ' 

sponds to obtaining the maximum weight with the lowest 
number of features. The cut point is then set between xj 
and Xj-i- 



6.2 Implementations of data families 

A total of twelve families of data sets were generated study- 
ing three different problems and four instances of each, by 
varying the number of relevant features Nn. Let xi , . . . , x„ 
be the relevant features of a problem /. 

Parity: This is the classic problem where the output is 
f{xi,--- ,Xn) = 1 if the number of Xi = 1 is odd and 
f{xi, • • ■ ,x„) = otherwise. 

Disjunction: Here we have f{xi,--- ,a;„) = 1 if {xi A 
• • • A Xn') V (xn'^i A • • • A Xn), with n' = ndiv2 (n even) 
and n' = (ndiv2) + 1 {n odd). 

GMonks: This p roblem is a ge neralization of the classic 
monks problems ( Thrunl . 1991 ). In its original version, 
three independent problems were applied on sets of n = 6 
features that take values of a discrete, finite and unordered 
set (nominal features). Here we have grouped the three 
problems in a single one computed on each chunk of 6 
features. Let n be multiple oi 6, k — ndiv6 and b — 
6(fc' + for l<k'<k. Let us denote for "1" the first 
value of a feature, for "2" the second, etc. The problems 
are the following: 



1. PI : {xb = Xb+i) V Xb+4 = 1 

2. P2 : two or more Xi = I in Xb ■ ■ ■ Xb+5 

3. P3 : {xb+4 = 3 A Xb+3 = 1) V {xb+4 ^3 A Xb+i ^ 2) 

For each chunk, the boolean condition P2 A -^{Pl A P3) 
is checked. If it is satisfied for Uc div 2 or more chunks 
(being Uc the number of chunks) the function Gmonks is 
1; otherwise, it is 0. 



6.3 Experimental setup 

The experiments are divided in three main groups. The 
first group explores the relationship between irrelevance 
vs. relevance. The second one explores the relationship 
between redundancy vs. relevance. The last group is the 
study of the effect of different sample sizes. Each group 
uses three families of problems {Parity, Disjunction and 
GMonks) with four different instances for each one, varying 
the number of relevant features Nf;, as indicated: 

Relevance: The different numbers Nr vary for each prob- 
lem, as follows: {4, 8, 16, 32} (for Parity), {5, 10, 15, 20} 
(for Disjunction) and {6, 12, 18, 24} (for GMonks). 
Irrelevance: In these experiments, Nj runs from zero to 
twice the value of N^. Specifically, Nj e {{k ■ Nr)/p, k = 
0, 1, ... , 10} (that is, eleven different experiments of irrel- 
evance for each Nr). The value of p is chosen so that all 
the involved quantities are integer: p = 4 for Parity, p = 5 
for Disjunction and p — 6 for GMonks. 
Redundancy: Analogously to the generation of irrelevant 
features, we have Nfi> running from zero to twice the value 
of Nji (eleven experiments of irrelevance for each Nji). 
Sample Size: Given the formula \S\ = akNxc (see ij5]), 
different problems were generated considering k S {0.25, 
0.5, 0.75, 1.0, 1.25, 1.75, 2.0}, Nt = Nb. + Ni + Nr, , c = 2 
and a = 20. The values of Nj and Nr/ were fixed as 
Ni = Nr, = NRdiv2. 

6.4 Discussion of the results 

Due to space reasons, only a representative sample of the 
results is presented, in graphical form, in Figs. [5] and [31 
In all the plots, each point represents the average of 10 
independent runs with different random data samples. The 
Figs. [2ja) and (b) are examples of irrelevance vs. relevance 
for four instances of the problems, (c), (d) are examples 
of redundancy vs. relevance and (e), (f) of sample size 
experiments. In all cases, the horizontal axis represents 
the ratios between these particulars as explained above. 
The vertical axis represents the average results given by 
the score criterion. 

• In Fig. [2Ia) the C-Sbg algorithm shows at first a good 
performance but clearly falls dramatically (below the 
0.5 level from Nj = Nr on) as the irrelevance ratio 
increases. Note that for Nr = 4 performance is per- 
fect (the plot is on top of the graphic). In contrast, in 
Fig. [2{b) the Relief algorithm presents very similar 
and fairly good results for the four instances of the 
problem, being almost insensitive to the total number 
of features. 

• In Fig. ^c) the LVF algorithm presents a very good 
and stable performance for the different problem in- 
stances of Parity. In contrast, in[2Id) Qbb tends to a 
poor general performance in the Disjunction problem 
when the total number of features increases. 

• The plots in Figs. [SJe) and (f) show additional in- 
teresting results because we can appreciate the curse 
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Figure 2: Selected results of the experiments: (a),(b) is irrelevance vs. relevance, (c),(d) are examples of redundancy 

vs. relevance and (c), (f) of sample size experiments. The horizontal axis is the ratio between these quantities. The 
vertical axis is the average result given by the score in 10 independent runs with different random data samples. 



of dimensionality (|Jain and Zongkeii 119971 ) . In these 
figures, Lvi and W-Sfg perform increasingly poorly 
(see the figure from top to bottom) with higher num- 
bers of features, provided the number of examples is 
increased in a linear way. However, in general, as long 
as more examples are added, performance is better 
(left to right). 

A summary of the complete results is displayed in Fig. [3] 
for the ten algorithms, allowing for a comparison across all 
the sample datasets with respect to each studied particu- 
lar. Specifically, Figs. [Sja), (c) and (d) show the average 
score of each algorithm for irrelevance, redundancy and 
sample size, respectively. Moreover, Figs. El^b), (d) and 
(f) show the same average weighed by A^i?, in such a way 
that more weight is assigned to more difficult problems 
(higher Nn). In each graphic there are two keys: the key 
to the left shows the algorithms ordered by total average 
performance, from top to bottom. The key to the right 
shows the algorithms ordered by average performance on 
the last abscissa value, also from top to bottom. In other 
words, the left list is topped by the algorithm that wins 
on average, while the right list is topped by the algorithm 
that ends on the lead. This is also useful to help reading 
the graphics. 

• Fig.[3ja) shows that Relief ends up on the lead of the 
irrelevance vs. relevance problems, while Sffg shows 
the best average performance. The algorithm W-Sfg 
is also well positioned. 

• Fig. ^c) shows the algorithms LvF and Lvi, together 
with C-Sbg, as the overall best. In fact, there is a 
bunch of algorithms that also includes the two float- 
ing and Qbb showing a close performance. Note how 
Relief and the wrappers are very poor performers. 

• Fig- Ele) shows how the wrapper algorithms extract 
the most of the data when there is a shortage of it. 
Surprisingly, the backward wrapper is just fairly posi- 
tioned on average. The Sffg algorithm is again quite 
good on average, together with C-Sbg. However, all 
of the algorithms are quite close and show the same 
kind of dependency to the amount of available data. 
Note the general poor performance of E-Sfg, most 
likely due to the fact that it is the only algorithm 
that computes its evaluation measure (entropy in this 
case) independently for each feature. 

The weighed versions of the plots (Fig.[3](b),(d) and (f)) 
do not seem to alter the picture very much. A closer look 
reveals that the differences between the algorithms have 
widened. Very interesting is the change for Relief, that 
takes the lead both on irrelevance and sample size, but not 
on redundancy. 

6.5 General considerations 

The results point to Sffg as the best algorithm on aver- 
age in complete ignorance of the particulars of the data 



set, or whenever one is willing to use a single algorithm. 
However, in view of the reported results, a better strategy 
would be to run various algorithms in a coupled way (i.e., 
in different execution orders and piping the respective so- 
lutions) and observe the results. Specifically, we suggest to 
use Relief when one is interested in detecting irrelevance, 
LvF for detecting redundancy and W-Sfg in presence of 
small sample size situations. In light of this, we conjecture 
that Sffg used in a wrapper fashion could be a better 
one-fits-all option for small to moderate size problems. 
We would like to bring to attention the following points: 

1. The wild differences in performance for different al- 
gorithms and data particulars: fixing an algorithm A 
and a problem P, performance of A is dramatically 
different for the various particulars considered (but in 
a consistent way in all instances of P). However, these 
results are coherent and scale quite well for increasing 
numbers of relevant features. 

2. The score criterion seems to reliably capture what in- 
tuition tells about the quality of a solution at this 
simple level. 

We would also like to emphasize the fact that the dif- 
ferences in the outcome yielded by the algorithms are not 
entirely due to their different approach to the problem. 
Rather, they are also attributable to the lack of a pre- 
cise optimization goal, for example in the form described 
in Definition [TJ Another good deal is the finite (and pos- 
sibly very limited) sample size which, on the one hand, 
hinders the obtention of an accurate evaluation of rele- 
vance. On the other, the dependence on a specific sample 
reminds us that every evaluation of relevance in a feature 
subset should be regarded as the outcome of a random 
variable, different samples yielding different outcomes. In 
this vei n, the use of re sampling techniques like Random 
Forests imanl . is strongly recommended. 

A final interesting point is the relation between the eval- 
uation given by a specific inducer and the score. We were 
interested in ascertaining whether higher inducer evalua- 
tions imply higher scores. We next provide evidence that 
this need not be the case by means of a counterexample. 

Conjecture: given a FSA and the solution it yields in a 
data set, we know this solution is suboptimal in the sense 
that better solutions may exist but are not found. How- 
ever, we would expect the solution to be better (i.e. have 
a higher score) the better its performance is. 

Experiment: we run W-Sfg in 10 independent runs 
with different random data samples of size 600 using Naive 
Bayes as inducer in an instance of the GMonks problem, 
described by Nr = 2A,Nr, = 12 and Nj = 24 for A^t = 60. 
Table [1] shows the results: for each run, the final inducer 
performance is given, as well as the score of the solutions. 
Runs 5 and 8 correspond to very different solutions (num- 
ber 5 being much better than number 8) that have almost 
the same inducer evaluation. Run 5 also has a lower eval- 
uation than run 9, but a greater score. 
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Figure 3: Results ordered by total average performance on the data sots (left inset) and by end performance (right 
inset). Figs, (b), (d) and (f) are weighed versions of (a), (c) and (e), respectively. 
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Table 1: Results of the experiment on variability. 



This same experiment can be used to show the variabil- 
ity in the results as a function of the data sample. It can 
be seen that the numbers of relevant and redundant as well 
as irrelevant features depend very much on the sample. A 
look at the precise features chosen reveals that they are 
very different solutions (a fact that is also indicated by the 
score) that nonetheless give a similar evaluation by the in- 
ducer. Given the incremental nature of W-Sfg, it can be 
deduced that classifier improvements where obtained by 
adding completely irrelevant features. 



7 CONCLUSIONS 



The task of a feature selection algorithm (FSA) is to pro- 
vide with a computational solution to the feature selection 
problem motivated by a certain definition of relevance or, 
at least, by a performance evaluation measure. This al- 
gorithm should also be increasingly reliable with sample 
size and pursue the solution of a clearly stated optimiza- 
tion goal. The many algorithms proposed in the literature 
are based on quite different principles and loosely follow 
these recommendations, if at all. In this research, several 
fundamental algorithms have been studied to assess their 
performance in a controlled experimental scenario. A mea- 
sure to evaluate FSAs has been devised that computes the 
degree of matching between the output given by a FSA and 
the known optimal solution. This measure takes into ac- 
count the particulars of relevance, irrelevance, redundancy 
and size of synthetic data sets. 

Our results illustrate the pitfall in relying in a single al- 
gorithm and sample data set, very specially when there is 
poor knowledge available about the structure of the solu- 
tion or the sample data size is limited. The results also 
illustrate the strong dependence on the particular condi- 
tions in the data set description, namely the amount of 
irrelevance and redundancy relative to the total number 
of features. Finally, we have shown by a simple example 
how the evaluation of a feature subset can be misleading 
even when using a reliable inducer. All this points in the 
direction of using hybrid algorithms (or principled com- 
binations of algorithms) as well as resampling for a more 
reliable assessment of feature subset performance. 



This work can be extended in many ways, to carry up 
more general evaluations (considering richer forms of re- 
dundancy) and using other kinds of data (e.g., continuous 
data). A specific line of research is the corresponding ex- 
tension of the scoring criterion. 
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Appendix 

Proposition. The score fulfills the two conditions: 

a) Sx{A)=O^A = Xi 

b) Sx{A) = l^ A = X* 

Proof: 

a) I I Let A = Xj. Then Sx{A) = Sx(Xi); since Ai = XjHA = 
Xj n Xi = Xj , we have 7 = 0; since Ar = An' we have R = R' = 0. 
T hus Sx {Xi) = 0. 

I I Suppose SxiA) = 0; since all terms that make up Sx(A) 
arc iioii-ncgativc, it is necessary that all of them arc zero. Now 
_R = implies Ar U Ar/ = 0, which implies Ar = Ari = 0; then 
A = Ar, thus A = An Xj and hence A C Xj . Since / must be zero, 
|^/| = \Xi\ and therefore A = Xi. 

b) Suppose A = X*; then it can be checked that R = R' = I = \ 

and thus Q/f -R^ + cxri R'^ + afl^ = cxr + cxri + qj = 1. Now this is 
the only way to achieve this value, since any other situation A ^ X* 
leads to either R, R' or I to be less than 1. 



